Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

12/21/2023 Dask Demo Day notebook #1

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

12/21/2023 Dask Demo Day notebook #1

wants to merge 3 commits into from

Conversation

cisaacstern
Copy link
Owner

@cisaacstern cisaacstern commented Nov 1, 2023

Today @jacobtomlinson @rabernat and I discussed presenting Apache Beam's DaskRunner at the upcoming Dask Demo Day on 11/16/2023.

Jacob & Ryan, here is a first pass on a notebook for that presentation, which is viewable in rendered form here: https://github.com/cisaacstern/beam-dask-demo/blob/first-pass/demo.ipynb.

Does this look like it's headed in the right direction?

@cisaacstern cisaacstern changed the title First pass Notebook for 11/16/2023 Dask Demo Day Nov 1, 2023
@cisaacstern cisaacstern changed the title Notebook for 11/16/2023 Dask Demo Day 11/16/2023 Dask Demo Day notebook - first pass Nov 1, 2023
@jacobtomlinson
Copy link

jacobtomlinson commented Nov 2, 2023

This looks great! I think there are a few things we want to cover in the demo

  1. What is Beam?
  2. Why do people like Beam?
  3. How do you run Beam on Dask?

The notebook touches on all of these, but it would be great to pull back and figure out how we can very clearly communicate the answers to those questions. We only have 7 minutes or so.

@cisaacstern
Copy link
Owner Author

Thanks for the review, @jacobtomlinson! I will incorporate this feedback and ping you again once I have an updated rev. In the meantime (and related), I realized @alxmrs previously presented on this at a Dask Demo Day about a year ago. Here are some reference materials I found:

  • Support for a Dask runner apache/beam#18962 (comment) - link to Google Doc (also here) with running notes on the Dask Runner covering multiple coordination meetings. Lots of Googlers weighed in on this doc, and it includes notes on some interesting topics (e.g., making the DaskRunner portable to other languages using Beam's gRPC framework, etc.).
  • Alex's slides from Dask Demo Day last year. Lots of high-level observations relevant to your questions above in here.

@jacobtomlinson
Copy link

Yeah, that's a good point, Alex did a great overview. I'm keen to dig a little more into the specifics though (which I acknowledge is hard in a 7 minute demo). If you watch the recording you'll see Matt and I ask a bunch of questions about motivation and design, it would be nice to address those again here more explicitly.

I think it makes sense to try and recap a little of the "what and why?" from Alex's demo but then dig more into the "how?" and finish up with "what is next and what is missing?".

I love that your notebook starts a Dask cluster and then runs a Beam pipeline on it, concrete examples like that really help.

I think we also want to focus on the fact that the Beam programming model is very popular, hugely due to it's portability. Then talk about how many folks are mature in deploying and using Dask distributed clusters, Alex already mentioned HPC centres with Dask services as one example. It would be cool to make it clear that this is adding another feature to the existing Dask toolkit.

The only mature managed Beam runner is GCP Dataflow so if the Beam on Dask integration matured a little I could see Coiled justifiably claiming that they can provide the best managed Beam on AWS experience.

@cisaacstern
Copy link
Owner Author

This framing is very helpful, thanks Jacob. The video link is also great context, I've transcribed the Q&A here for reference:

  • Matt: Streaming data in Beam, how does that work with Dask Bag?
    • [A] Not implemented yet, could be batch-only runner.
  • Matt: Are futures a better interface than bag?
    • wasn't answered
  • Matt: Why? Why did we get into this? User-based? What's the motivation?
    • [A] Pangeo Forge had the problem that it wanted to compile to different execution frameworks. Beam provides this, if you are willing to adopt the programming model. Motivated to write DaskRunner to support Pangeo Forge using of Beam's flexible deployment model on Dask.
  • Jacob: Has encountered this from the other side, i.e. thinking about how to write a graph in Dask API, and run it on Dataflow. But this is the other way around: write in Beam API, run on a Dask scheduler/cluster. What is the motivation there? What does Dask give over Dataflow that makes this worth it?
    • [A] Compare to RayRunner. Why RayRunner if we have Spark, etc. A lot of HPC have Dask already set up. Dask would allow Beam (which is initially designed for cloud) to run on HPC. It's harder for scientists to run on cloud than it is for us to adapt Beam to HPC with Dask.

@alxmrs
Copy link

alxmrs commented Nov 4, 2023 via email

@cisaacstern cisaacstern changed the title 11/16/2023 Dask Demo Day notebook - first pass ~~11/16/2023~~ Dask Demo Day notebook - first pass Nov 9, 2023
@cisaacstern cisaacstern changed the title ~~11/16/2023~~ Dask Demo Day notebook - first pass TBD upcoming Dask Demo Day notebook - first pass Nov 9, 2023
@cisaacstern
Copy link
Owner Author

👋 @jacobtomlinson I'm now wondering if it may be worth pushing our Demo Day presentation until we have a little more new material to share beyond what @alxmrs already presented? I'm all for near-term milestones and not needing things to be polished, but I'm also recognizing that there has not been any notable feature development on the DaskRunner since that last presentation, so I worry about being a bit repetitive.

I also have new additional context to share, which is that I just filed apache/beam#29365. This bug represents a huge roadblock to any continuing development work on the DaskRunner insofar as it breaks testing (due to Beam's testing utils relying on GroupByKey internally).

Rather than an immediate next step being the Demo Day, what would you say to shifting gears to do a bit of async collaboration on fixing that groupby bug? Then we can circle back to this notebook once its fixed, with some new and compelling content to share: "hooray, groupby works!" (I am happy to do the heavy lifting here, but could use some help on understanding what to expect re: Dask bag partitioning, distributed shuffles, etc.)

@jacobtomlinson
Copy link

Dask Demo Day is very informal and typically undersubscribed in terms of content. Let's still go ahead and give a demo next week as there is a lot of value in repeating and expanding a little, then do a follow-up in a couple of months to show progress.

Collaborating on that issue sounds like another good step though.

@cisaacstern
Copy link
Owner Author

Sounds good! I will be offline for the rest of the week for a long weekend. I will revisit addressing the broader context questions you posed in the notebook early next week.

Do we need to officially reserve a presentation slot somehow?

@cisaacstern cisaacstern changed the title TBD upcoming Dask Demo Day notebook - first pass 11/16/2023 Dask Demo Day notebook - first pass Nov 9, 2023
@jacobtomlinson
Copy link

@cisaacstern it sounds like the Dask Demo Day has been cancelled this month in favour of the dask-expr event being hosted by Coiled and NVIDIA. The next Demo Day is the last working week in December before the holidays so that gives us a little more time to prepare.

@cisaacstern
Copy link
Owner Author

@jacobtomlinson that's great actually! I'll get those updates into the notebook in advance of the rain date. Talk soon!

@cisaacstern cisaacstern changed the title 11/16/2023 Dask Demo Day notebook - first pass 12/21/2023 Dask Demo Day notebook - first pass Dec 4, 2023
@cisaacstern cisaacstern changed the title 12/21/2023 Dask Demo Day notebook - first pass 12/21/2023 Dask Demo Day notebook Dec 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants